TrTok: A Fast and Trainable Tokenizer for Natural Languages
نویسندگان
چکیده
We present a universal data-driven tool for segmenting and tokenizing text. The presented tokenizer lets the user define where token and sentence boundaries should be considered. These instances are then judged by a classifier which is trained from provided tokenized data. The features passed to the classifier are also defined by the user making, e.g., the inclusion of abbreviation lists trivial. This level of customizability makes the tokenizer a versatile tool which we show is capable of sentence detection in English text as well as word segmentation in Chinese text. In the case of English sentence detection, the system outperforms previous methods. The software is available as an open-source project on GitHub1.
منابع مشابه
A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization
Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it rapidly increases in complexity for systems that apply it for different languages. This article proposes a supervised learning approach to perform the tokenization task. The method presented in this article is based on ...
متن کاملMultilingual number transcription for text-to-speech conversion
This paper describes the text normalization module of a text to speech fully-trainable conversion system and its application to number transcription. The main target is to generate a language independent text normalization module, based on data instead of on expert rules. This paper proposes a general architecture based on statistical machine translation techniques. This proposal is composed of...
متن کاملTokenizer Adapted for the Nasa Yuwe Language
In Colombia, ethnic and cultural diversity is conceived by the government to be a social right. Such diversity finds expression, among other ways, in a large number of indigenous languages, which have been kept alive for centuries. However, efforts toward conservation and preservation of these languages have generally fallen short. This is the case for the Nasa Yuwe language, spoken by the Nasa...
متن کاملAprendizaje Atomático para el Reconocimiento Temporal Multilinge basado en TiMBL
This paper presents a Machine Learning-based system for temporal expression recognition. The system uses the TiMBL application, which is a memorybasedmachine learning system. The portability of the system to other new languages has a very low cost, because it does not need any dependent language resource (only requires a tokenizer and a POS tagger, although the lack in POS tagger does not have ...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Prague Bull. Math. Linguistics
دوره 98 شماره
صفحات -
تاریخ انتشار 2012